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Abstract 

Tensor decomposition is a powerfiil computational tool for multiway data anal- 
ysis. Many popular tensor decomposition approaches — such as the Tucker decom- 
position and CANDECOMP/PARAFAC (CP) — amount to multi-linear factoriza- 
tion. They are insufficient to model (i) complex interactions between data entities, 
(ii) various data types (e.g.missing data and binary data), and (iii) noisy observa- 
tions and outliers. To address these issues, we propose tensor-variate latent non- 
parametric Bayesian models, coupled with efficient inference methods, for multi- 
way data analysis. We name these models InfTucker. Using these InfTucker, we 
conduct Tucker decomposition in an infinite feature space. Unlike classical ten- 
sor decomposition models, our new approaches handle both continuous and binary 
data in a probabilistic framework. Unlike previous Bayesian models on matrices 
and tensors, our models are based on latent Gaussian or t processes with nonlinear 
covariance functions. To efficiently learn the InfTucker from data, we develop a 
variational inference technique on tensors. Compared with classical implementa- 
tion, the new technique reduces both time and space complexities by several or- 
ders of magnitude. Our experimental results on chemometrics and social network 
datasets demonstrate that our new models achieved significantly higher prediction 
accuracy than the most state-of-art tensor decomposition approaches. 



1 Introduction 

Many real-world datasets with multiple aspects can be described by tensors (i.e., mul- 
tiway arrays). For example, email correspondences can be represented by a tensor 
with four modes (sender, receiver, date, content) and user customer 
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ratings by a tenor with four modes (user, item, rating, t ime ). Given the 
tensor-valued data, traditional multiway factor models — such as the Tucker decom- 
position [21] and CANDECOMP/PARAFAC (CP) [6]— have been widely applied for 
various applications {e.g., network traffic analysis [25], computer vision [17] and so- 
cial network analysis [18, 14, 19], etc). These models, however, face serious challenges 
for modeling complex multiway interactions. First the interactions between entities in 
each mode may be coupled together and highly nonlinear. The classical multi-linear 
models cannot capture these intricate relationships. Second, the data are often noisy, 
but the classical models are not designed to deal with noisy observations. Third, the 
data may contain many missing values. We need to first impute the missing values 
before we can apply the classical multiway factor models. Forth, the data may not be 
restricted to real values: they can be binary as in dynamic network data or have ordinal 
values for user- movie-ratings. But the classical models simply treat them as continuous 
data — this treatment would lead to degenerated predictive performance. 

To address these challenges we propose a nonparametric Bayesian multiway analy- 
sis model, InfTucker. Based on latent Gaussian processes or t processes, it conducts the 
Tucker decomposition in an infinite dimensional feature space. It generalize the elegant 
work of Chu and Ghahramani [5] by capturing nonlinear interactions between differ- 
ent tensor modes. Grounded in a probabilistic framework, it naturally handles noisy 
observations and missing data. Furthermore, it handles various data types — ^binary or 
continuous — by simply using suitable data likelihoods. Although InfTucker offers an 
elegant solution to multiway analysis, learning the model from data is computationally 
challenging. To overcome this challenge, we develop an efficient variational Bayesian 
approach that explores tensor structures to significantly reduce the computational cost. 
This efficient inference technique also enables the usage of nonlinear covariance func- 
tions for latent Gaussian and t processes on datasets with reasonably large size. 

Our experimental results on chemometrics and social network datasets demonstrate 
that the InfTucker achieves significantly higher prediction accuracy than state-of-the- 
art tensor decomposition approaches — including High Order Singular Value Decom- 
position (HOSVD) [10], Weighted CP [1] and nonnegative tensor decomposition [17]. 

2 Preliminary 

Notations. Throughout this paper, we denote scalars by lower case letters (e.g. a), 
vectors by bold lower case letters (e.g. a), matrices by bold upper case letters (e.g. A), 
and tensors by calligraphic upper case letters (e.g. A). Calligraphic upper case letters 
are also used for probability distributions, e.g., A/'(/i., S). We use to represent 
the (i, j) entry of a matrix U, yi to represent the i — {ii, . . . , ik) entry of a tensor 
3^. U (g) V denotes the Kronecker product of the two matrices there. We define the 
vectorization operation, denoted by vcc(3^), to stack the tensor entries into a HfeLi 
by 1 vector The entry i = (zi, . . . ,\k) of 3^ is mapped to the entry at position j = iK + 
E^l7^fe-l)nfc+i^H-ofvec(3^)'. The mode-fc product of a tensor W € W^'^-'^'^^ 

'Unlike the usual column-wise vec-operation, our definition of vec() on matrices is row-wise, 
which avoids the use of transpose in many equations throughout this paper. 
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with a matrix U G K^x'"'-- is denoted as W xj. U and it is of size ri x . . . x r^-i x 
n X Tfc+i X . . . X vk- The corresponding entry-wise definition is 

(W Xk U)i,...,,_^jj,^^...,^ =J2w^,...^^Uj^^. (1) 

i = l 

Tensor decomposition: There are two famihes of tensor decomposition, the Tucker 
family and the CP family. The Tucker family extends bilinear factorization models to 
handle tensor datasets. For an observed if -mode tensor y E jjnix...xnK^ j-jjg general 
form of Tucker decomposition is 

3^ = WxiU(i) X2...xa'U(-^) (2) 

where W G jj^-ix-.-xrA- ^jjg ^^^g tensor, and UC") e ]]j"fcX'-*! are K latent factor 
matrices. As in [9], We collectively denote the group of K matrices as a Tucker tensor 
with a identity core U = [U*^^^ , • ■ • , U^^^] — this allows us to compactly represent the 
Tucker decomposition asy = W x U. The vector form of (2) is 

vec(W xU)^ U(i) (g) U(2) (g)...(g) U(^^ ■ veciW) (3) 

The CP family is a restricted form of the Tucker family. The entry-wise definition of 
CP is yi-^...iK — Ym=i ^i^iii ■ ■ ■ Uij^i- The alternating least square (ALS) method has 
been used to solve both Tucker decomposition and CP [9]. 

3 Infinite Tucker decomposition 

In this section we present the infinite Tucker decomposition based on latent Gaussian 
processes and t processes. The following discussion is primarily for latent Gaussian 
processes. The model derivation for latent t processes is similar to that of latent Gaus- 
sian processes. 

We extend classical Tucker decomposition in three aspects: i) flexible noise mod- 
els for both continuous and binary observations; ii) an infinite core tensor to model 
complex interactions; and iii) latent Gaussian process prior or latent t process. 

More specifically, we assume the observed tensor y is sampled from a latent real- 
valued tensor via a probabilistic noise model p(3^|A1) = niPCyil"^!)- 

We conduct Tucker decomposition for with a core tensor W of infinite size. 
To do so, we use a countably infinite feature mapping for the rows of the component 
matrix U^'^) G M"'=><'', fc = 1, . . . , if. Let u^''^ denotes the i-fli row of U('=\ A feature 
mapping : M'' — > M^" maps each u^*^^ to the infinite feature space (^(ul'^^), where 
Hq denotes the countable infinity. The inner product of the feature mapping is denoted 
as Sg) = (0(uf)),0(uf )). Let (/)M(uf)) = [0i(uf )),..., </.,(uf-')] denote the 
first r coordinates of (/()(u^'^^), W G M^o^ denote an infinite iiT-mode core tensor, and 
yV'f'') = [wiY^^^i ^ denote the first r dimensions in every mode of W. The 
infinite Tucker decomposition "A^ — W x (t)iU)" for the latent tensor can be 
formally defined as the limit of a series of finite Tucker decompositions. 

M = lim W^'^^ xi 0W(U(i)) X2...XK (t>^''\V^^^) (4) 
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where ^'^(U^) = [(^«(ui''y, . . . , ^M^'lT]^- 

As shown in the next Section, we use a latent tensor-variate Gaussian process prior 
on W and then marginaUze it out to obtain a Gaussian process over Ai. Alternatively, 
we can also use a latent tensor-variate t process prior on W and obtain a t process over 
M. 

3.1 Tensor-variate Gaussian processes 

Before formally defining the tensor-variate t process, we denote the domain of the 
mode A: by Uk, the K co variance functions by E^*-') : [/^ x ^ M, the covanance 

matrices by a Tucker tensor ^ [(S(i))"^ , . . . , {'E^^iy^] and n = nf=i "fc. 
The norm of the a tensor ||^|| is defined as ^/^^a?. Then we define tensor-variate t 
processes as follows. 

Definition 1 (Tensor-variate Gaussian Processes) Given K location sets Uk, k = 
l,...,K,letb:UiX...xUK ^ R be the mean function. M {/(u^^), . . . , u(^))|u('=) G 
Uk} is a set of random tensor variables where f : Ui x . . . x Uk ~^ R is a random 
function. For any finite sets {u[''\ . . . , ul'^^lfcLi' ^ [■^("ji^' • ■ ' ' ^ 

j^ni x...xn/f ^ where jk ~ 1, ■ • • , JT-fe, be a random tensor and B = [6(u^||\ . . . , u^^-')]vj G 
]^rtix...xn/f mean tensor 

We say Ai ^ TQV{Ai\B, {l^^'^^}^^^) follows a tensor-variate Gaussian process, 
if A4 follows a tensor-variate normal distribution: 

K 
k=l 

exp|-i||(X-6) x5-5||2|. (5) 

In this paper, we set the mean function to be zero, i.e. B — Q. Let N{v, fi, S) 
denotes a normal distribution with mean fi and covariance S. If the latent tensor J\4 
is drawn from a tensor-variate Gaussian process, then vec(A^) ^ Af{0,^), where 
^ = S^^^ (g) . . . (g) We choose the prior on the truncated core tensor W'^^^ to be 

TAf{0, {IrjfeLi)' where 1^ denotes the identity matrix. The next theorem proves that 
the limit defined in (4) is the corresponding tensor process. 

Tlieorem2 LetUk C W, and T.i''\u\''\ uf^ ) = (0W(uf <?!)W(uf ^)) beaseries 
of covariance functions. Define a multi-linear function by 

5«(u(i), . . . ,u(^)) - xi 0W(u(i)) ...Xk (fy^'Hn^''^), 

where n^'^^ G Uk. IfW^'''^ - r7V(t/, 0, {I^f^J, f/ie« gM(u(i), . . . , uW) /o/- 
lows a tensor-variate Gaussian distribution 7W(0, {Sr'^'*}^j^), and it converges to 
TGViO, {S('')}f^i) in distribution as r ^ oo. 

The proof of Theorem 4 can be found in Appendix A. 

(fe) 

Finally, to encourage sparsity in estimated u] — for easy model interpretation — 
we use Laplace prior u^'^'' ^ oc exp(— A||u^'^''||i). 
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3.2 Tensor- variate t processes 

Because of the strong relation between i-distributions and Gaussian distributions — t 
distributions can be regarded as mixtures of Gaussian distributions weighted by Gamma 
distributions, we can easily define tensor-variate t processes: 

Definition 3 (Tensor-variate t Processes) Let r{x) be the Gamma function. The set 
M follows a tensor-variate t process TTV{h',b,{'S^''^}^^^) with degree of freedom 
V > 2, if M follows tensor t distribution with the following density 



We can also prove a similar convergence result for tensor-variate Gaussian distri- 
bution. 

Tlieorem 4 //Wt'^' - TT{v, 0, {Irjf^i), then g^''\n^^\ . . . , u^^'^) follows a tensor- 
variate t distribution TT{v, 0, {Sr'^''}^-^), and it converges to tensor-variate t pro- 
cesses TTV{iy, 0, {EC'^lf^i) in distribution as r ^ oo. 

The proof of Theorem 2 follows exactly the same path as that of the convergence 
result for tensor-variate Gaussian processes. 

The above theorem shows that probabilistic infinite Tucker decomposition of Ai 
can be realized by modeling as a draw from a tensor-variate t process on the location 
vectors induced from the unknown component matrices U^'^^ Our definition of tensor- 
variate t processes generalizes matrix- variate t process defined in [26]. Theorem 2 also 
suggests a constructive definition of tensor-variate processes for general covariance 
functions. 

3.3 Noise models 

We use a noise model p{y\M) to link the infinite Tucker decomposition and the tensor 
observation y. 

Probit model: In this case, each entry of the observation is binary; that is, yi € 
{0, 1}. A probit function p{yi\mi) — $(mi)^'(l — $(r7ii))^^^' models the binary 
observation. Note that $(•) is the standard normal cumulative distribution function. 

Gaussian model: We use a Gaussian likelihood p{yi\mi) = Af{yi\mi,a'^) to 
model the real-valued observation yi. 

Missing values: We allow missing values in the observation. Let O denote the 
indices of the observed entries in y. Then we have p{yo\A4o) as the likelihood. 

Other noise models include modified probit models for ordinal regression and multi- 
class classification [2], null category noise models for semi-supervised classification 
[12]. In this paper we focus on probit and Gaussian models. 



r(|)K)t 



k 




5 



4 Algorithm 



Given the observed tensor y, we aim to estimate the component matrices U^'^^ by max- 
imizing the marginal HkeHhood p(3^|{U('^)}^]^)p({U'''''}^j^). Integrating out Ai in 
the above equation is intractable however. Therefore, we resort to approximate in- 
ference; more specifically, we develop a variational expectation maximization (EM) 
algorithm. In the following paragraphs, we first present the inference and prediction 
algorithms for both of the noise models, and then describe an efficient algebraic ap- 
proach to significantly reduce the computation complexity. Due to space limitation, we 
only describe the algorithm for tensor-variate t-distribution. The algorithm for tensor- 
variate Gaussian distribution can be derived similarly. 

4.1 Inference 

Probit noise: We follow the data augmentation scheme by Albert and Chib [2] to 
decompose the probit model into p{yi\mi) — J p{yi\zi)p{zi\mi)dzi . Let S{-) be the 
indicator function, we have 

p{yi\zi) = Siyi = l)5(zi > 0) + Siy, - 0)S{zi < 0), 
p{zi\mi) = Af{zi\mi, 1) 

It is well known that a t distribution can be factorized into a normal distribution con- 
volved with a Gamma distribution, such that 

rr(X|i^,0,{S«}f^i) = J Gami^W/2,iy/2y 

TAf{M\0,{v-'^'''S^''^ti)dv, (6) 

where TAf denotes the tensor-variate normal distribution. The joint probabihty likeli- 
hood with data augmentation is 

p{y,Z,M,r,.l^) ^ p{y\Z)piZ\M)p{M\ii,U)piiMi^)- (7) 

where p{A4\r],U) and p{r]) is the tensor-variate normal distribution and the Gamma 
distribution in (6). p{U) is the Laplace prior 

Our variational EM algorithm consists of a variational E-step and a gradient-based 
M-step. In the E-step, we approximate the posterior distribution p{Z,Ai, ri\y,U) by 
a fully factorized distribution A^, 77) = q{Z)q{M.)q{ri). Variational inference 
minimizes the Kullback-Leibler (KL) divergence between the approximate posterior 
and the true posterior. 

uim^h{q[Z)q{M)q{v)\\p{Z,M.v\yM)) ■ (8) 

The variational approach optimizes one approximate distribution, e.g., q{Z), in (8) at 
a time, while having all the other approximate distributions fixed [3]. We loop over 
q{Z), q{A4) and q{ri) to iteratively optimize the KL divergence until convergence. 
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Given q{Ai) and q{ri), the q{zi) is a truncated normal distribution 



g(zi) cxAA(E,K],l)<5(^i > 1), (9) 

F [ 1 F r 1 , (2yi - 1)AA(E, K] |0, 1) 

Eg Zi = Eg m; + — — — — — — - — . (10) 

^{{2yi - l)Eg [TOi]) 

Given q{Z) and (7(ry), it is more convenient to write the optimized approximate distri- 
bution for A4 in its vectorized form. Let Sp = S^^' (g) . . . (g) we have 

g(vec(X)) = A/'(vec(7W)|/x, T), (11) 
/X = vec(E, [A^]) = Y vec(E, [Z]) (12) 

Y = E,[77]-iSp(l + E,M-iSp)"\ (13) 
The optimized q{ri) is also a Gamma distribution: 



g(77)=Gam(r;|/3i,/32), E, [77] = ^, /? 



2 



/32- ^ . 

Based on the variational approximate distribution obtained in the E-step, we maxi- 
mize the expected log likelihood over = [U(i) , . . . , U(^)] in the M-step. 

msixEy[logpiy,Z,M,v\U)piU)]. (14) 
After eliminating constant terms, we need to solve the following optimization problem 

K 

mm f{U) = ^ — log |S(^-)| + t||E, [M] x S'^^^W^ 

k=l 

K 

+ Ttr(E;iT)+A^||Ufc|ji, (15) 

k=l 

where r = E, [77]. In the above equation (15), S^''^ = Yj^^\\]'^^\JJ'-^'>) is consid- 
ered as a function of U^, and S^^/"^ is a function of U. T and t are the statistics 
computed in the E-step, and they have fixed values. The gradient of f{U) w.r.t. to a 
scalar can be found in Appendix B. With an £1 penalty on f(U), we choose a 
projected scaled subgradient L-BFGS algorithm for optimization — due to its excellent 
performance [16]. 

Gaussian noise: The inference for the regression case follows the same format 
as the binary classification case. The only changes are: 1) replacing Eg [Z] by y 
and skipping updating q{Z). 2) The variational EM algorithm are only applied to the 
observed entries. 
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4.2 Prediction 

Probit noise: Given a missing value index i = (zi, . . . , za'), the predictive distribution 
is 

vivi - i|3^) ~ 

p{yi — l\'mi)p{rni\A4,rj)q{J^)q{r])midA4drj (16) 

The above integral is intractable, so we replace 77 integral q{rj)drj by the mode of its 
approximate posterior distribution r* = (/3i — l)//32> thus the predictive distribution 
is approximated by 

p{yi — l\zi)p{zi\mi)p{mi\M,T*)q{M)dzidmidM 

S{zi>0)M{zi\^li{l),ly?{l))dzi 



where 



K 



k=l 

=k^(Sp + pVl)"'vcc(3^) 
1 

T 



•^fip) = 1 + i) - kT(Sp + pV*i)-ik] 



Gaussian noise: The predictive distribution for the regression case is the following 
integral 



p(yi|3^o) ~ / piyi\mi)p{mi\M,ri)q{M)q{r])midMdr] 



piVi — l\mi)p{mi\M,T*)q{M)dzidmidM 
=AA(zi|Mi(a),^.?(a)). (18) 

4.3 Efficient Computation 

A naive implementation of the above algorithm requires prohibitive OdlfeLi time 
complexity and OdIfcLi "^fe) space complexity for each EM iteration. The key com- 
putation bottlenecks are the operations involving T defined in equation (13). To avoid 
this high complexity, we can make use of the Kronecker product structure. We assume 
Eg [rj] = 1 to simplify the computation, it is easy to adapt our computation strategies 
to Eg [t]] ^ 1. Let S^^) = VC") A^'^^V^^)^ be the singular- value decomposition of the 
covariance matrix S^''^ V*^*^' is an orthogonal matrix and A*^*^^ is a diagonal matrix. 
Y can be represented as 
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aW(i + a(^))-iv(^')t. 

Let V = V(i) (g)...(g) V(-^), A = A(i) (I + A(i))-i ® . . . (g) A^ (I + A^'^^)-\ It is 
obvious that V is an orthogonal matrix and A is a diagonal matrix. The above relation 
implies that we can actually compute the singular value decomposition of Y = VAV^ 
from CO variance matrices 'S^'^K 

In order to efficiently compute tr(Sp ^Y) appearing in equation (15), we use the 
following relations 

tr(S-iY) = tr(S-iV^AV) ^ tr(AVS-iV^) 
= diag(VS;iV^)^diag(A) 
= di (g) . . . (g) dif diag(A) = di (g) . . . (g) dx vec(I?) 
= Vxidi...XKdK, (19) 

where d^ — diagCV^^' (S('''))^^V('')^ )^ with S^'^) being a computed statistics in 
the E-step, diag(A) denotes the diagonal elements of A, and I? is a tensor of size 
rii X . . . X Ufc, such that vec(2?) — diag(A). Both time and space complexities of the 

last formula (19) is 0(nfcLi ^k)- 

We denote V = [V^^), . . . , V(^)] and = [V^^)^, . . . , V(^)^]. For any tensor 
A of the same size as T>, Avec(.4) means multiplying the j-th element of vec(y^) 
by the Ajj, which is the j-th element of vecV. So we have Avec(y^) — vec(I? 
A), where denotes the Hadamard product, i.e. entry-wise product. In light of this 
relation, we can efficiently compute ( 1 2) by 

Y vec(E, [Z]) = vec [((E, [Z] x V^) 2?) x V] . (20) 

The right-hand side of Equation (20) effectively reduce the time and space complexities 
of the left-hand side operations to 0{J2k=i + "'t) Ilfe^i "fe) 0{J2k=i '^fc+ 

nl^i "-fc), respectively. 

We can further reduce the complexities by approximating the covariance matrices 
via truncated SVD. 

5 Related Works 

The InfTucker model extends Probabilistic PCA (PPCA) [20] and Gaussian process 
latent variable models (GPLVMs) [11]: while PPCA and GPLVM model interactions 
of one mode of a matrix and ignore the joint interactions of two modes, InfTucker 
does. Our model is also related to previous matrix-variate GPs Yu et al. [23], Yu and 
Chu [22]. The main difference lies in the fact they used linear covariance functions 
to reduce the computational complexities and dealt with matrix-variate data for online 
recommendation and link prediction. 
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Data 



ammo 



flow injection 



bread 



CP 
TD 



0.053±0.002 
0.054±0.002 
0.053±0.002 
0.057±0.005 
0.054±0.002 
0.049±0.004 
0.047±0.003 
0.047±0.003 



0.051 ±0.005 
0.051 ±0.003 
0.052±0.004 
0.110±0.023 
0.048±0.002 
0.079±0.011 
0.049±0.002 
0.046±0.002 



0.238±0.001 
0.248±0.001 
0.259±0.001 
0.233±0.001 
0.240±0.001 
0.246±0.003 
0.232±0.001 
0.225±0.001 



HOSVD 



NCP 
PTD 
WCP 



InfTucker^P 
InfTuckei^P 



Table 1: The mean square errors (MSB) with standard errors. The results suggested that our 
new approaches-InfTucter*^ and InfTuckei^'' — achieved higher prediction accuracy than all 
the competing approaches. In particular, the improvements of InfTucker'^^ over all the other 
methods on all datasets (except InfTucker^^ on the amino dataset) are statistically significant 
(p < 0.05). 

The most closely related work is the probabilistic Tucker decomposition (pTucker) 
model [5]; actually the GP-based InfTucker reduces to pTucker as a special case when 
using a linear covariance function. Our TP-based InfTucker further differs from pTucker 
by marginalizing out a scaling hyperparameter of the covariance function. Another re- 
lated work is probabilistic high order PCA [24], which is essentially equivalent to a 
linear PPCA after transforming the tensor to a long vector Hoff [7] proposed a hierar- 
chical Bayesian extension to CANDECOMP/PARAFAC that captures the interactions 
of component matrices. Unlike these approaches, ours can handle non-Gaussian noise 
and uses nonlinear covariance functions to model complex interactions. In addition, 
Chu and Ghahramani [5] did not exploits the Kronecker structure of the covariance 
matrices, so it is difficult for pTucker to scale to large datasets and high order tensors; 
and Hoff [7] used a Gibbs sampler for inference — requiring high computational cost 
and making their approach infeasible for tensors with moderate and large sizes. By 
contrast, we provide a deterministic approximate inference method that exploits struc- 
tures in Kronecker products in a variational Bayesian framework, making InfTucker 
much more efficient than competing methods. 

To handle missing data, enhance model interpretability, and avoid overfitting, sev- 
eral extensions (e.g., using nonnegativity constraints) to tensor decomposition have 
been proposed, including nonnegative tensor decomposition (NTD) [17, 8, 13, 15] and 
Weighted tensor decomposition (WTD) [1]. Unlike ours, these models either solve the 
core tensors explicitly, or do not handle nonlinear multiway interactions. 

Finally, note that the inference technique described in Section 4 can be adopted for 
Gaussian process or t-process multi-task learning [4, 26]. Let M be the number of 
tasks and N be the number of data points in each task. Our inference technique can be 
used to reduce their time complexity from 0{M^N^) to 0{M^ + N^) and the space 
complexity from 0{M'^N'^) to OiN"^ + A-P). 
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6 Experiments 



We use InfTuckei^^ and InfTuckei^^ to denote the two new infinite Tucker decom- 
position models based on tensor-variate Gaussian and t processes, respectively. To 
evaluate them, we conducted two sets of experiments, one on continuous tensor data 
and the other on binary tensor data. For both experiments, we compared InfTucker with 
the following tensor decomposition methods: CANDECOMP/PARAFAC (CP), Tucker 
decomposition (TD), Nonnegative CP (NCP), High Order SVD (HOSVD), Weighted 
CP (WCP) and Probabilistic Tucker Decomposition (PTD). We implemented PTD as 
described in the paper by Chu and Ghahramani [5] and applied to a small continuous 
tensor data (bread as described in the 6.1.1). To handle larger and binary datasets, we 
used probit models and the efficient computation techniques described in Section 4.3 
for PTD. For the other methods, we used the implementation of the tensor data analysis 
toolbox- developed by T. G. Kolda. 

6.1 Experiment on continuous tensor data 
6.1.1 Experimental setting 

We used three continuous chemometrics datasets', amino, bread, and flow injection. 
The amino dataset consists of five simple laboratory-made samples. Each sample con- 
tains different amounts of tyrosine, tryptophan and phenylalanine dissolved in phos- 
phate buffered water. The samples were measured by fluorescence (excitation 250-300 
nm, emission 250-450 nm, 1 nm intervals) on a PE LS50B spectrofluorometer with 
excitation slit-width of 2.5 nm, an emission slit-width of 10 nm and a scan-speed 
of 1500 nmls. Thus the dimension of the tensor is 5 x 51 x 201. The bread data 
describes five different breads which were baked in replicates, giving a total of ten 
samples. Eight different judges assessed the breads with respect to eleven different 
attributes in a fixed vocabulary profiling analysis. Hence the dimension of the ten- 
sor is 10 X 11 X 8. The^ovv injection data describes a flow injection analysis (FIA) 
system where a pH-gradient is imposed. In this setup, a carrier stream containing a 
Britton-Robinson buffer of pH 4.5 is continuously injected into the system with a flow 
of 0.375 mL/min. The 77 fiL of sample and 770 /iL of reagent (Britton-Robinson 
buffer pi/ 11.4) are injected simultaneously into the system by a six-port valve and 
the absorbance is detected by a diode-array detector (HP 8452A) from 250 to 450 nm 
in two nanometer intervals. The absorption spectrum is determined every second 89 
times during one injection. Thus this dataset is a 12 (samples) x 100 (wavelengths) x 
89 (times) array. 

All the above tensor data were normalized such that each element of the tensor has 
zero mean and unit variance (based on the vectorized representations). For each ten- 
sor, we randomly split it via 5-fold cross validation: each time four folds are used for 
training and one fold for testing. This procedure was repeated 10 times, each time with 
a different partition for the 5-fold cross validation. In InfTuckefi , the degree of free- 
dom u in the tensor-variate t process is fixed to 10. We chose the Gaussian/exponential 

^http : / /csmr . ca . sandia . gov/ -tgkolda/ Tens or Toolbox/ 
'Available from http : / / www . models . kvl . dk/ datasets 
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covariance functions E'^'^^ (u^, Uj) = e "j" , where t = 1,2 and 7 is selected 

from [0.01 : 0.05 : 1] by 5-fold cross validation. The regularization parameter A for 
InfTuckei^P and InfTucker'^P is chosen from {1, 10, 100}. 

6.1.2 Results 

We compared the the prediction accuracies of all the approaches on hold-out elements 
of the tensor data. For each comparison, we used the same number of latent factors, 
denoted as r, for all the approaches. We varied r from 3 to 5 and computed the av- 
eraged mean square errors (MSEs) and the standard errors of the MSEs. Based on 
cross-validation, we set r = 3. The MSEs on the three datasets are summarized in 
Table 1. Based on the prediction accuracies, PTD and WCP tie on the third best, while 
HOSVD is the worst ( perhaps due to the strong nonnegativity constraint on the latent 
factors). Clearly, InfTuckei^^ achieved higher prediction accuracies than all the previ- 
ous approaches on all the datasets, and InfTuckefi further outperformed InfTuckei^^ 
for most cases. 

6.2 Experiment on binary tensor data 

6.2.1 Experimental setting 

We extracted three binary social network datasets, Enron, Diggl, and Diggl, for our 
experimental evaluation. Enron is a relational dataset describing the three-way rela- 
tionship: sender-receiver-email. This dataset, extracted from the Enron email dataset"*, 
has the dimensionality of 203 x 203 x 200 with 0.01% non-zero elements. The 
Diggl and Digg2 datasets were all extracted from a social news website digg . com^. 
Diggl describes a three-way interaction: news-keyword-topic, and Digg2 describes 
a four-way interaction: user-news-keyword-topic. Diggl has the dimensionality of 
581 X 124 X 48 with 0.024% non-zero elements, and Diggl has the dimensionality of 
22 X 109 X 330 x 30 with only 0.002% non-zero elements. Apparently these datasets 
are very sparse. 

6.2.2 Results 

We chose r from the range {3,5,8,10,15,20} based on cross-validation. Since the data 
are binary, we evaluated all these approaches by area-under-curve (AUC) values aver- 
aged over 50 runs. The larger the averaged AUC value an approach achieves, the better 
it is. We reported the averaged AUC values for all algorithms in Figure 1. Again, 
the proposed InfTuckei^^ and InfTucker*^ approaches significantly outperform all 
the others. Note that the nonprobabilistic approaches — such as CP and TD — suffer 
severely from the least square minimization; given the sparse and binary training data, 
the least-square-minimization leads to too many predictions with zero values, a result 

Available at http : / / www . cs . emu . edu/ ~enron/ 

^Available at http : / / www . public . asu . edu / -ylinS 5/kddO 9sup . html 
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Figure 1: The area under curve (AUC) values of six algorithms on three multi-way 
networks. The dimensions of the latent factors are r = 3, 5, 8, 10, 15, 20 respectively. 
The proposed InfTucker models performed significantly better than the other methods. 



of both overfitting and mis-model fitting. This experimental comparison fully demon- 
strates the advantages of InfTucker (stemming from the right noise models and the 
nonparametric Bayesian treatment). 
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7 Conclusion 



To conduct multiway data analysis, we have proposed a new nonparametric Bayesian 

tensor decomposition framework, InfTucker, where the observed tensor is modeled 
as a sample from a stochastic processes on tensors. In particular, we have employed 
tensor-variate Gaussian and t processes. This new framework can model nonUnear in- 
teractions between multi-aspects of the tensor data, handle missing data and noise, and 
quantify prediction confidence (based on predictive posterior distributions). We have 
also presented an efficient variational method to estimate InfTucker from data. Ex- 
perimental results on chemometrics and social network datasets demonstrated that the 
superior predictive performance of InfTucker over the alternative tensor decomposition 
approaches. 
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A Proof of Theorem 2 

Proof Sketch If - r7V(0, {IJ^^J, then vec(>V('')) - A/'(0,W). Let 

U'^''^ k = 1, . . . ,K he K location sets (matrices) as used in (4), we have 

vec(w('') X (/.M(Z^)) =(/)M(U(i))(g)...(^0('')(U(^))vec(W(''^) (21) 

Thus, vec(wW x (jj^^'HU)) - AA(i/,0,sf^' sf^"^'), where = 

Er*^-* (u^'^-' , u^*^-* ) are the covariance matrix. Inverting (21) gives W^'''^ x (j)^''\U) ~ 

rA^(0,{sl''^}f^i), which proves i'*"' follows the tensor Gaussian process. 

From the definition of inner product in £2, we have the following identity on the 
convergence of covariance function. 

I]('^-)(uf\uf)) = lim S«(uf\uW), Vuf ,uf) G U, 
Convergence in distribution follows from this convergence result. 



B Gradient of /(^Y) 



df n 

- — tr 



(^(SW)-i^j +rM^AW/x + rtr(AWT) (22) 



(g)(S('=+i))-i(»...(g)(S(-f^))-i 
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